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Abstract 



In recent years, data streaming has gained prominence due to advances in technologies that enable many applications to 
generate continuous flows of data. This increases the need to develop algorithms that are able to efficiently process data 
streams. Additionally, real-time requirements and evolving nature of data streams make stream mining problems, including 
clustering, challenging research problems. 

In this paper, we propose a one-pass streaming soft clustering (membership of a point in a cluster is described by a distri- 
bution) algorithm which approximates the "soft" version of the k-means objective function. Soft clustering has applications in 
various aspects of databases and machine learning including density estimation and learning mixture models. We first achieve a 
simple pseudo-approximation in terms of the "hard" k-means algorithm, where the algorithm is allowed to output more than k 
centers. We convert this batch algorithm to a streaming one (using an extension of the k-means++ algorithm recently proposed) 
in the "cash register" model. We also extend this algorithm when the clustering is done over a moving window in the data 
stream. 



The problem of clustering a group of data items into similar groups is one of the most widely studied research problems 
with applications in databases, machine learning and computational geometry. Given a set of points and pairwise distance 
(or similarity) between the points, clustering algorithms divide the points into sets such that points in each set are "close" or 
"similar" with respect to an objective function. Clustering problems arise in two main flavors - hard clustering, where each 
point's membership is exclusively to a single cluster, and soft clustering, where membership of a point in a cluster is described 
by a distribution. 

Often, clustering problems arise in a geometric setting, where the data items are points in high-dimensional Euclidean space. 
In such a setting, it is natural to define the distance between two points as the Euclidean distance between them. In this paper, 
we will assume this setting for the clustering problem. One of the most popular definitions for clustering is the k-means problem 
which is defined as follows. Given an integer k and a set of n data points X G M d , the objective of k-means problem is to give 
k centers C so as to minimize the objective function 



where d(x, c) = ||x — c||. 

Estimating parameters of a distribution from sampled data is one of the oldest and most general problems of statistical 
inference. Given a number of samples, one needs to choose a distribution that best fits the observed data. While traditionally 
theoretical analysis in the statistical literature has concentrated on rates (e.g., minimax rates), in recent years other computational 
aspects of this problem, especially as dependence on dimension of the space, have attracted attention. This effort has been 
particularly directed at the family of Gaussian Mixture models due to their simple formulation and widespread use in several 
applications spanning databases, computer vision, and machine learning. There are strong connections between learnability of 
Gaussian mixtures and clustering GJH). In this context, clustering appears in its "soft" form. 

In soft clustering, each data point is assigned to several clusters partially. For each point x we have a coefficient giving the 
degree of being in the i th cluster ui (x) . Usually, the sum of those coefficients for any given x is defined to be 1, X,i=i u i M = 
1 . The objective of soft k-means problem is to give k centers C so as to minimize the potential function, 



1 Introduction 
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With the explosive growth of financial, social and scientific data sources, it becomes increasingly important to design clus- 
tering algorithms which can process the data in the streaming fashion. In the data stream model of computation, the points are 



read in a sequence and we desire to compute a function, clustering in our case, on the set of points seen so far. This is called the 
cash register model. In typical applications, the total volume of data is very large and can not be stored in its entirety. Another 
model for streaming is called the moving window model, where the function is computed only on L most recent points seen 
in the stream. This is, typically, of more interest in a practical setting. However, with both insertion and deletion of points, 
algorithm design gets more challenging. 

1.1 Our Contributions 

Our main result establishes the relationship between hard and soft clustering. For a particular form of soft clustering, called 
in the literature as fuzzy k-means (or fuzzy c-means), we show that k-means approximates it by a factor 0(f(k)) (f(k) is a 
polynomial in k and its precise form depends on a parameter in the definition of fuzzy k-means). This result, coupled with 
the 0(log k) approximation result for k-means of Arthur and Vassilvitskii [4|, we obtain the first approximation algorithm for 
fuzzy k-means. 

Ailon et al. O extend the k-means++ algorithm to a streaming algorithm (in the cash register model). A secondary result in 
our paper is to adapt their algorithm into a streaming version in the moving window model. A natural consequence of our work 
is a streaming algorithm for soft clustering in both streaming models. 

2 Previous Work 

One of the most popular heuristic algorithms for k-means is Lloyds algorithm [20|, which initially chooses k centers randomly. 
For each input point, the nearest center is identified. Points that choose the same center belong to a cluster. New centers are 
calculated for the clusters by computing the centroid of points within a cluster. This process is repeated until no changes occur. 
It is easy to show that the cost function does not increase during any iteration. Hence, this algorithm converges to a local 
minimum. Its main attractiveness is its simplicity and speed. However, there is no guarantee on the quality of the obtained 
solution lfl"8l . 

The fastest exact algorithm for the k-means clustering problem was proposed by Inaba et al. ifTTl . They observed that the 
number of Voronoi partitions of k points in R d is 0(n kd ) and so the optimal k-means clustering could be determined exactly 
in time 0(n kd+1 ). They also proposed a randomized (1 + e) -approximation algorithm for the 2-means clustering problem 
with running time 0(n/e d ). 

The k-means problem is known to be NP-hard even for k = 2 [13]. Matousek [21 1 gave the first PTAS for this problem, 
with running time polynomial in n for a fixed k and d. Kanungo et al. |18| proposed an 0(n 3 e~ d ) algorithm that is (9 + e)- 
competitive by adapting the k-median algorithm of Arya et al. . Har-Peled and Mazumdar [ 1 6 1 propose a ( 1 + e ) -approximate 
solution to the k-means problem with running time 0(n+k k+2 e~' 2d+1 ' k log k+1 nlog k (l/e)). For fixed k and d, they achieve 
linear running time. The algorithm uses a coreset construction by sampling in an exponential grid. Kumar et al. [ 19 1 propose 
a simple (1 + e) -approximation scheme with a running time of 0(2' k / £ ' dn), for a fixed k. Their idea is to recursively 
approximate the centroid of the largest remaining cluster by trying all subsets of constant size from a sample followed by 
pruning sufficient points from this large cluster. 

Mettu and Plaxton ll22l propose a technique called successive sampling to achieve a constant factor approximation for the 
k-median problem. This idea was adapted independently by Ostrovsky et al. 11231 and Arthur and Vassilvitskii [4| for the k- 
means problem. The main idea of both these results is to choose the initial centers for Lloyd's algorithm carefully using a clever 
sampling technique. Ostrovsky et al. Il23l achieve a constant factor approximation provided the input satisfies an e-separated 
condition. Arthur and Vassilvitskii [4], however, do not make this assumption and achieve a 0(log k) -competitive algorithm. 

Soft clustering has applications in various applications spanning databases, statistical inference and machine learning. The 
two most popular versions of soft clustering are fuzzy k-means |[T4l l6ll and Expectation Maximization (EM) IfTSl algorithms. 
At a high level, both algorithms are rather similar, performing a two-step iterative optimization until convergence. The first 
step, called the expectation (E) step, is an assignment of each data point to clusters or density models as a distribution, and the 
second step, called the maximization (M) step, re-estimates the clusters based on the current assignments. Just like Lloyd's 
algorithm, the iterative optimization procedure may result in a local optimum. Relatively little is known about these methods 
from a theoretical point of view. The problem of giving an approximation algorithm to the fuzzy k-means problem is considered 
open fin . 

Guha et al. fi5l provide a streaming algorithm for the k-median problem. In particular, they propose a simple divide and 
conquer strategy to give a constant-factor, single -pass approximation in time 0(uk) and sublinear 0(n a ) space for constant 
a > 0. Charikar et al. [8| gave a constant-factor, single-pass k-Center algorithm using O(nklogk) time and O(k) space. 
Recently, Ailon et al. combined the results of Guha et al. lfl5l and Arthur and Vassilvitskii [4] to propose a O(log k) factor 
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approximation to the k-means problem. From a practical point of view, Ackermann et al. fl 1 provide a non-uniform sampling 
approach to obtain small coresets from the data stream to solve the k-means problem. 



3 Problem Definition and Known Results 

3.1 k-means problem 

These centers (or cluster centers) define a clustering - all the points closest to a center than to any other center define a cluster. 
Finding an exact solution to the k-means problem is NP-hard. A well known algorithm called "Lloyd's algorithm" [20] is an 
algorithm that is guaranteed to find a local optimal solution to the problem, which can often be quite poor. 

3.2 k-means++ algorithm 

The authors of |4] proposed a way of initializing k-means by choosing random starting centers with certain probabilities which 
give a 0(log k) -competitive algorithm to k-means problem with a running time of O(nkd). The initial seeding of k-means++ 
is described in the following algorithm. 

1. Choose an initial center Ci uniformly at random from X. 

2. Choose the next center Ci selecting Ct = x' e X with probability ^ D ^ n l v i , where Dfx) is the shortest distance from 
a data point x to the closest center we have already chosen. 

3. Repeat Step 2 until we have chosen a total of k centers. 

3.3 k-means# Algorithm 

The authors of [3| extended the kmeans++ algorithm to give an algorithm that provides O(klogk) centers to yield 0(1) 
competitive strategy for k-means with constant probability. This algorithm chooses 3 log k centers randomly in the first round. 
Further, 3 log k centers are chosen in step 2 of k-means++ and is repeated (k — 1 ) times as in k-means++ algorithm. Since 
the guarantees are with constant probability, the algorithm needs to be repeated large enough times to get better guarantees. 
For instance, the authors of [3 1 repeat the algorithm 0(log n) times to get a non-competitive solution with probability at-most 
0(1/n). 

3.4 Streaming k-means 

A streaming version of k-medians was provided in [15|. This idea was used by the authors of Q to provide a streaming 
version of k-means. A multi-level algorithm is used for a given memory order n a . In all but the last level, n" data points are 
compressed to O (k log k) using k-means# algorithm (using best run of O (log n) trials of k-means#). In the last level, n a data 
points will be compressed to k points using k-means++ algorithm. The guarantees using this algorithm can be summarized in 
the following theorem. 

Theorem 3.1 (|3 1). If there is access to memory of size M = n a for some fixed a > 0, then for sufficiently large n the best ap- 
plication of the multi-level scheme described above is obtained by running r — 0(log(n/M)/ log(M/klog k)) levels (which is 
constant), and choosing the repeated k-means# for all but the last level, in which k-means++ is chosen. The resulting algorithm 
is a randomized streaming approximation to k-means, which is O[logli)-competitive. Its running time is 0(dnk 2 lognlogk). 

3.5 Soft k-means 

In hard clustering, each data point is assigned to its closest center. However, in soft clustering, each data point is assigned to 
several clusters partially. For each point x we have a coefficient giving the degree of being in the i th cluster u^(x). Usually, 
the sum of those coefficients for any given x is defined to be 1, ^^ =1 ui(x) = 1 . The centroid of a cluster is the mean of all 
points, weighted by their degree of belonging to the cluster, or 

Ci = L^uMx 
2_ X -Mx) 

We will assume following [6] that, 
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m( x ) = T , , . (3.2) 

The objective of soft k-means problem is to give k centers C so as to minimize the potential function, 



k 

(P^^^Uifxjdd,^ 2 . (3.3) 

xGA 1 i=1 

For m = 1 , this is equivalent to normalizing the coefficient linearly to make their sum 1 . For m — > 0, the cluster centers 
approach k-means centers. We will assume that < m < 1 . For a given number of clusters k, soft clustering is usually done 
using EM algorithm which can be defined as follows. 

1 . Choose k centers at random. 

2. Repeat until the algorithm has converged : 

(a) For each point x, compute U|(x). 

(b) Compute the centroid Ci for each cluster. 



4 k-means as an approximation for soft k-means 

In this section, we will show that choosing the centers obtained by k-means give O(k m /' 1 ~ m ')-competitive algorithm. We 
will show that using the optimal centers of k-means algorithm gives an approximation to the soft k-means problem. Further, 
since k-means++ algorithm is O(logk) competitive to the k-means algorithm, k-means++ is 0(k m /' 1 ~ Trl ') logk competitive 
algorithm for soft k-means problem. 

Theorem 4.1. k-mean centers give an 0{Y. m /^~ m -^)-competitive algorithm for soft k-means. 
The rest of the section provides the proof of this theorem. 

Let c* ■ • • c£ be the optimal k centers of the k-means problem. Then, the objective function of soft k-means with these 
centers is given by, 



Note that 



<5(c*] ^ ^^Ui(x)d(x,c*; 



x£X i=l 



Li=i d(x,c* 



*1-2(l/m-l ) 



xe * Li k =1 d(x,ct)- 2 /- 



(4.1) 



^{d[x,ct)-^^)^r > n) 1751=1 ( ^ (d(X)Ctr 2 ( i/ m -i ))r 



i=i 



(4.2) 



i=l 



This is because ^f =1 af > vd-i (^i=i a i) d f° r d > 1. This is true since for a convex function, f(EX) < Ef(X). Using 
uniform discrete distribution over ai and letting f (x) = x d , we get the above result. 
Substituting in ( 14. Il l, we get 



Ofc* 



< 



xex i=i 



X, C; 



-2(1/m-1)^1- 



1 /m— 1 



(4.3) 
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Substituting 1 /m — 1 = g, we get, 

k 

<D(c*) < kT^ ^(^d(x,c*)- 2 9)-i 



x£X i=l 



(1) 

< 



k"^ V" ( max d(x,c*r 2g ri 

' ... VI. 



v ie{1,-,Tc} 



kT^V(( min d(x,c*))- 2 9)- 

' icf! ... M 



., ie{1,-,Tc} 
x£ A: 



k^ V" m in d(x,c*) 2 , (4.4) 



„-,te{i,-,K} 



where (1) follows since sum of non-negative terms is at-least as large as the maximum of the terms and g > 0, and (2) follows 
since g > 0. 

Note that this is the objective function of k-means and the centers c* are optimal for this problem. Thus, for any k centers 
Ci, 1 < i < k, we have 



*\2 



®(c*) < k"^ Y min d(x,c£ 
k"!^" V" min dfx, Ci) 2 

= k^ y min d(x,c t ) 2 ^ ^ar 2/m 

kT Jn_ y it] d t*> CiT 2/m min ie{1 )k} d(x, cQ 2 



(3) 



xe* Li=id(x,Ci) 

d(x,ctr 2/m d(*>c0 2 



y , 



k _jiL_ v- L i= i d(x, Ci ) 



tr x Lt =1 afx )Ci )- 2 / m 

= kT^Ofc), (4.5) 

where (3) follows since min iG {i ... k j d(x, Ci) 2 < d(x, Ci) 2 . 

Since the above holds for any centers Ct and thus also for the optimal centers of the soft k-means problem. Thus, we prove 
that the centers of k-means are atmost k^^ -competitive to soft k-means. 

We can further see that using k-means++ centers give an additional log k in the approximation. This is because by taking the 
centers of k-means++ rather than k-means, all the steps till J4.41 i directly hold. Also, since k-means++ is O (log k) competitive, 
Y.^ex n " n ie{l,— ,k} d(x, c\ ) 2 < O(logk) Y.xex mm i.e{i,-- ,Tc} d(x, Ci) 2 which adds an extra O(log k) in the eventual result. 



5 Streaming Soft k-means 

In the previous section, we saw that k-means++ based initialization gives an approximation for soft k-means. This algorithm 
has been adapted for streaming in the cache register model in [3|. The same algorithm can be used for soft k-means and the 
result in Theorem l3.1l hold as an approximation to soft k-means. The adapted statement to soft k-means can be stated as follows. 

Lemma 5.1. If there is access to memory of size M = n a for some fixed <x > 0, then for sufficiently large n the best 
application of the multi-level scheme described above is obtained by running r — 0(log(n/M)/log(M/klog k)) levels (which 
is constant), and choosing the repeated k-means# for all but the last level, in which k-means++ is chosen. The resulting 
algorithm is a randomized streaming approximation to soft k-means, which is 0{kJ=™ \og\Cj-competitive. Its running time is 
O ( dnk 2 log n log k). 

This streaming algorithm can also be adapted to streaming over a sliding window when the memory is also limited. In this 
model, k-means over a window are needed which is moving. In this model, the past data should be removed unlike the cache 
register model where new data keeps on adding and the old data need not be removed. 
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Suppose that sliding window length is L and the memory is O ( LK l (log k) 1 ) ~trr . Then the cache register model is used with 
t + 1 levels with every M points converted to 3k log k centers in the first t steps and M points converted to k points in the last 
step. Keep the M points at the t th level rather than throwing them away after converting to k points at t + 1 levels in the cache 
clustering model. 

Shifting window till a length of Lt+iKw (3 log k) t+t will have impact on only the first 3k log k points in the t tH level. So, 
after this window shift, remove the first 3k log k points at the t th level and add 3k log k points at t th level which are 3k log k 
centers given by k-mean# algorithm for the M points at (t — l) th level. Thus, we have M points have t tH level which give 
the required k centers. Since at this window shift, the k centers are directly of the sliding window with no extra data point 
or omitted point, the algorithm is 0(log k) competitive for k-means. If centers are needed at intermediate shift, we can use a 
weighted average on the last M — 3klog(k) points completely taken at t tH level and weighing first 3k log k points at t tH level 
and the points in the (t — 1 ) tH level based on the shift. 

Theorem 5.1. The above algorithm with a memory o/0(L e (klog k) 1_e ) gives an O(log k) competitive algorithm for k-means 
for a sliding window of length L at every window shift o/L 1 ~ 2e (3k log k) 2e . 

Proof. Using e = -A^, we get the above memory requirement. Further, at window shift of L 1_2e (3klogk) 2e , at the t th level, 
the first 3k log k points will all go out and the new 3k log k points will be added. There is no data point at any but the (t + 1 ) th 
level and the k centers at the t + 1 th level are O (log k) competitive by . □ 

Since the k-means centers are approximate centers for soft k-means, we can have the following approximation for soft 
k-means over a sliding window. 

Corollary 1. The above algorithm with a memory o/0(L e (klogk) 1_e ) gives an 0(k 1 - m logk) competitive algorithm for 
soft k-means for a sliding window of length L at every window shift q/L 1 ~ 2e (klog k) 2e . 



6 Empirical Results 

In order to evaluate k-means++ initialization in practice, we implemented the two algorithms in Matlab. We label the original 
algorithm as EM while the one based on k-means++ initialization of centers as EM++. The code is not optimized and is 
available at [2 1. We found that the seeding substantially improves both the running time and the accuracy of EM. 

We chose two datasets that are also included in |4). The first is the Spam dataset [24], which consists of 4601 points in 
58 dimensions. The second is Cloud dataset [ 1 1 which consist of 1024 points in 10 dimensions. For each dataset, we tested 
k = 10, 25, and 50 and m = 0.1, 0.25 and 0.5. 

Since we are testing randomized seeding process, we ran 20 trials on each case. The minimum and average potential and the 
mean running time are compared between EM and EM++. 

We find that seeding gives better speedups as well as better objective functions on these two datasets. 







Average O 


Mimimum O 


Average T 


m 


k 


EM 


EM++ 


EM 


EM++ 


EM 


EM++ 


0.1 


10 


1.665 x 10 s 


48.08% 


1.016 x 10 s 


23.49% 


34.155 


37.21% 


0.1 


25 


1.196 x 10 s 


85.63% 


5.36 x 10 7 


70.88% 


89.256 


16.87% 


0.1 


50 


6.304 x 10 7 


89.91% 


7.763 x 10 b 


23.27% 


231.594 


17.2% 


0.25 


10 


1.748 x 10 s 


49.63% 


1.076 x 10 s 


22.13% 


37.401 


28.25% 


0.25 


25 


8.244 x 10 7 


78.92% 


1.666 x 10 7 


2.01% 


138.632 


5.52% 


0.25 


50 


7.838 x 10 b 


16.74% 


6.73 x 10 b 


9.13% 


258.503 


21.42% 


0.5 


10 


2.916 x 10 8 


60.81% 


2.329 x 10 s 


51.11% 


72.826 


57.96% 


0.5 


25 


4.325 x 10 7 


38.77% 


2.658 x 10 7 


7.37% 


218.268 


9.63% 


0.5 


50 


1.202 x 10 7 


7.08% 


1.09 x 10 7 


3.58% 


645.887 


47% 



Table 1: Experimental results on the Spam dataset (n=4601, d=58). For EM, we list the actual potential and time in seconds. 
For EM++, we list the percentage improvement over EM: 100% x (l — ^MmTilr )- 
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Mimimum <t> 


Average T 


m 


k 


EM 


EM++ 


EM 


EM++ 


EM 


EM++ 


0.1 


10 


6.462 


8.74% 


6.319 


8.28% 


3.742 


24.65% 


0.1 


25 


2.403 


12.27% 


2.226 


10.09% 


35.456 


61.02% 


0.1 


50 


1.704 


33.85% 


1.41 


22.99% 


36.232 


48.47% 


0.25 


10 


6.682 


7.28% 


6.407 


6.25% 


6.401 


27.04% 


0.25 


25 


2.318 


5.37% 


2.137 


2.13% 


28.986 


36.2% 


0.25 


50 


1.256 


4.16% 


1.197 


3.7% 


56.295 


40.97% 


0.5 


10 


8.762 


10.2% 


8.762 


12.75% 


5.142 


26.45% 


0.5 


25 


3.389 


0.072% 


3.287 


0% 


17.087 


-143.93% 


0.5 


50 


2.339 


1.65% 


2.313 


3.76% 


90.739 


18.41% 



Table 2: Experimental results on the Cloud dataset (n=1024, d=10). For EM, we list the actual potential divided by 10 s and 
time in seconds. For EM++, we list the percentage improvement over EM: 1 00% x (1 — ^Mvatue^ )- 
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